{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "collapsed": true,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "mMulrrXe6_0i"
      },
      "source": [
        "# Pytorch入门实战（4）：基于LSTM实现文本的情感分析"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 本文涉及知识点\n",
        "\n",
        "[Pytorch nn.Module的基本使用](https://blog.csdn.net/zhaohongfei_358/article/details/122797244)\n",
        "\n",
        "[Pytorch nn.Linear的基本用法](https://blog.csdn.net/zhaohongfei_358/article/details/122797190)\n",
        "\n",
        "[Pytorch中DataLoader的基本用法](https://blog.csdn.net/zhaohongfei_358/article/details/122742656)\n",
        "\n",
        "[Pytorch nn.Embedding的基本使用](https://blog.csdn.net/zhaohongfei_358/article/details/122809709)\n",
        "\n",
        "[详解torch.nn.utils.clip_grad_norm_ 的使用与原理](https://blog.csdn.net/zhaohongfei_358/article/details/122820992)"
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "TyKcUptm6_0k"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 本文内容\n",
        "\n",
        "本文基于文章[Long Short-Term Memory: From Zero to Hero with PyTorch](https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/)的代码，对该文章代码进行了一些修改和注释添加。该文章详细的介绍了LSTM，如果对LSTM不熟悉的朋友，可以先看下改文章。\n",
        "\n",
        "本文使用的亚马逊评论数据集，训练一个可以判别文本情感的分类器。\n",
        "\n",
        "数据集如下：\n",
        "\n",
        "```\n",
        "链接：https://pan.baidu.com/s/1cK-scxLIliTsOPF-6byucQ\n",
        "提取码：yqbq\n",
        "```"
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "YTCs253y6_0l"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 数据预处理\n",
        "\n",
        "首先导入要使用的包："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "rv0IV3qr6_0l"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "outputs": [],
      "source": [
        "import bz2 # 用于读取bz2压缩文件\n",
        "from collections import Counter # 用于统计词频\n",
        "import re # 正则表达式\n",
        "import nltk # 文本预处理\n",
        "import numpy as np"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "dcJoLvsE6_0m"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "将数据样本解压到当前目录的data目录下，其中包含两个文件：train.ft.txt.bz2”和“test.ft.txt.bz2”\n",
        "\n",
        "解压后，读取训练数据和测试数据："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "RhSZFZBQ6_0m"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "/usr/local/lib/python3.7/dist-packages/gdown/cli.py:131: FutureWarning: Option `--id` was deprecated in version 4.3.1 and will be removed in 5.0. You don't need to pass it anymore to use a file ID.\n",
            "  category=FutureWarning,\n",
            "Downloading...\n",
            "From: https://drive.google.com/uc?id=1BKma83La6Cx3m1rWmnQScHjTWj-z65NP\n",
            "To: /content/data.zip\n",
            "100% 517M/517M [00:05<00:00, 90.6MB/s]\n"
          ]
        }
      ],
      "source": [
        "!gdown --id '1BKma83La6Cx3m1rWmnQScHjTWj-z65NP' --output data.zip"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "87YmVQWf6_0n",
        "outputId": "d07d229e-4760-48cb-a97f-b126d80eaf96",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      }
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Archive:  data.zip\n",
            "  inflating: test.ft.txt.bz2         \n",
            "  inflating: train.ft.txt.bz2        \n"
          ]
        }
      ],
      "source": [
        "!unzip data.zip"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "l6pid1tY6_0o",
        "outputId": "4e23b4b6-1e5f-44d8-b09e-f9a10cf4af6e",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      }
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "b'__label__2 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^\\n'\n"
          ]
        }
      ],
      "source": [
        "train_file = bz2.BZ2File('train.ft.txt.bz2')\n",
        "test_file = bz2.BZ2File('test.ft.txt.bz2')\n",
        "train_file = train_file.readlines()\n",
        "test_file = test_file.readlines()\n",
        "print(train_file[0])"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "SAsgBBHj6_0o",
        "outputId": "359aba04-75cc-4cb9-dec2-cf8d9cd059d9",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "从上面打印的数据可以看到，每条数据由两部分组成，*Label*和*Data*。其中：\n",
        "\n",
        "- `__label__1` 代表差评，之后将其编码为0\n",
        "- `__label__2` 代表好评，之后将其编码为1\n",
        "\n",
        "由于数据量太大，所以这里只取100w条记录进行训练，训练集和测试集按照*8:2*进行拆分："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "_BxmKkli6_0o"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "outputs": [],
      "source": [
        "num_train = 800000\n",
        "num_test = 200000\n",
        "\n",
        "train_file = [x.decode('utf-8') for x in train_file[:num_train]]\n",
        "test_file = [x.decode('utf-8') for x in test_file[:num_test]]"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "uL3MhKtc6_0p"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "> 这里使用decode('utf-8')是因为源文件是以二进制类型存储的，从上面的`b''`可以看出\n",
        "\n",
        "源文件中，数据和标签是在一起的，所以要将其拆分开："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "4utoGSoz6_0p"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "outputs": [],
      "source": [
        "# 将__label__1编码为0（差评），__label__2编码为1（好评）\n",
        "train_labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in train_file]\n",
        "test_labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in test_file]\n",
        "\n",
        "\"\"\"\n",
        "`split(' ', 1)[1]`：将label和data分开后，获取data部分\n",
        "`[:-1]`：去掉最后一个字符(\\n)\n",
        "`lower()`: 将其转换为小写，因为区分大小写对情感识别帮助不大，且会增加编码难度\n",
        "\"\"\"\n",
        "train_sentences = [x.split(' ', 1)[1][:-1].lower() for x in train_file]\n",
        "test_sentences = [x.split(' ', 1)[1][:-1].lower() for x in test_file]"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "2JlI_ZXE6_0p"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "在对数据拆分后，对数据进行简单的数据清理：\n",
        "\n",
        "由于数字对情感分类帮助不大，所以这里将所有的数字都转换为0："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "Nk04Vc-y6_0p"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "outputs": [],
      "source": [
        "for i in range(len(train_sentences)):\n",
        "    train_sentences[i] = re.sub('\\d','0',train_sentences[i])\n",
        "\n",
        "for i in range(len(test_sentences)):\n",
        "    test_sentences[i] = re.sub('\\d','0',test_sentences[i])"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "07AVnZC16_0p"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "数据集中还存在包含网站的样本，例如：`Welcome to our website: www.pohabo.com`。对于这种带有网站的样本，网站地址会干扰数据处理，所以一律处理成：`Welcome to our website: <url>`："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "abM6rblh6_0q"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "outputs": [],
      "source": [
        "for i in range(len(train_sentences)):\n",
        "    if 'www.' in train_sentences[i] or 'http:' in train_sentences[i] or 'https:' in train_sentences[i] or '.com' in train_sentences[i]:\n",
        "        train_sentences[i] = re.sub(r\"([^ ]+(?<=\\.[a-z]{3}))\", \"<url>\", train_sentences[i])\n",
        "\n",
        "for i in range(len(test_sentences)):\n",
        "    if 'www.' in test_sentences[i] or 'http:' in test_sentences[i] or 'https:' in test_sentences[i] or '.com' in test_sentences[i]:\n",
        "        test_sentences[i] = re.sub(r\"([^ ]+(?<=\\.[a-z]{3}))\", \"<url>\", test_sentences[i])"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "oiHNzDE56_0q"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "数据清理结束后，我们需要将**文本进行分词**，并**将仅出现一次的单词丢掉**，因为它们参考价值不大："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "pVoDvGR16_0q"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "nltk.download('punkt') # 使用nltk.work_tokenize前，需要下载`punkt`"
      ],
      "metadata": {
        "id": "0SerO5AE7-pw",
        "outputId": "6337ca5e-754a-410a-bdee-8bbda9f3e0b5",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "execution_count": 12,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
            "[nltk_data]   Unzipping tokenizers/punkt.zip.\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {},
          "execution_count": 12
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.0% done\n",
            "25.0% done\n",
            "50.0% done\n",
            "75.0% done\n",
            "100% done\n"
          ]
        }
      ],
      "source": [
        "words = Counter() # 用于统计每个单词出现的次数\n",
        "for i, sentence in enumerate(train_sentences):\n",
        "    words_list = nltk.word_tokenize(sentence) # 将句子进行分词\n",
        "    words.update(words_list)  # 更新词频列表\n",
        "    train_sentences[i] = words_list # 分词后的单词列表存在该列表中\n",
        "\n",
        "    if i % 20000 == 0: # 每2w打印一次进度\n",
        "        print(str((i*100)/num_train) + \"% done\")\n",
        "print(\"100% done\")"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "uUIDMkDt6_0q",
        "outputId": "b3db9b70-b558-4f14-9278-e8ef4f6a3872",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "移除仅出现一次的单词："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "iWS8bm1g6_0q"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "outputs": [],
      "source": [
        "words = {k:v for k,v in words.items() if v>1}"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "AeW4L89X6_0r"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "将words按照出现次数由大到小排序，并转换为list，**作为我们的词典**，之后**对于单词的编码会基于该词典**："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "F_twWbLr6_0r"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "['.', 'the', ',', 'i', 'and', 'a', 'to', 'it', 'of', 'this']\n"
          ]
        }
      ],
      "source": [
        "words = sorted(words, key=words.get,reverse=True)\n",
        "print(words[:10]) # 打印一下出现次数最多的10个单词"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "7xGQFepO6_0r",
        "outputId": "2e1e1eb0-9f5d-41c4-a2ef-784e9d849b30",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "向词典中增加一个单词：\n",
        "\n",
        "- `_PAD`：表示填充，因为后续会固定所有句子长度。过长的句子进行阶段，过短的句子使用该单词进行填充"
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "R-kILnqA6_0r"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "outputs": [],
      "source": [
        "words = ['_PAD'] + words"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "mkBOi5Np6_0r"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "整理好词典后，对**单词进行编码**，即**将单词映射成数字**，这里直接使用单词所在的数字下表作为单词的编码值："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "n35FzP_M6_0r"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "outputs": [],
      "source": [
        "word2idx = {o:i for i,o in enumerate(words)}\n",
        "idx2word = {i:o for i,o in enumerate(words)}"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "NBWVkmM_6_0r"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "映射字典准备完毕后，就可以将`train_sentences`中存储的单词转化为数字了："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "FieIwn2u6_0r"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "outputs": [],
      "source": [
        "for i, sentence in enumerate(train_sentences):\n",
        "    train_sentences[i] = [word2idx[word] if word in word2idx else 0 for word in sentence]\n",
        "\n",
        "for i, sentence in enumerate(test_sentences):\n",
        "    test_sentences[i] = [word2idx[word.lower()] if word.lower() in word2idx else 0 for word in nltk.word_tokenize(sentence)]"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "mOCna9Q56_0r"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "> 上面的`else 0`表示：如果单词没有在字典中出现过，则使用编码0，对应上面的`_PAD`\n",
        "\n",
        "为了方便构建模型，需要固定所有句子的长度，这里选择200作为句子的固定长度，对于长度不够的句子，在前面填充`0`(`_PAD`)，超出长度的句子进行从后面截断："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "G95_HFbD6_0s"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "outputs": [],
      "source": [
        "def pad_input(sentences, seq_len):\n",
        "    \"\"\"\n",
        "    将句子长度固定为`seq_len`，超出长度的从后面阶段，长度不足的在前面补0\n",
        "    \"\"\"\n",
        "    features = np.zeros((len(sentences), seq_len),dtype=int)\n",
        "    for ii, review in enumerate(sentences):\n",
        "        if len(review) != 0:\n",
        "            features[ii, -len(review):] = np.array(review)[:seq_len]\n",
        "    return features\n",
        "\n",
        "# 固定测试数据集和训练数据集的句子长度\n",
        "train_sentences = pad_input(train_sentences, 200)\n",
        "test_sentences = pad_input(test_sentences, 200)"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "chNfmNcv6_0s"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "上述方法除了固定长度外，还顺便将数字转化为了numpy数组。Label数据集也需要转换一下："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "nqRUKb6s6_0s"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "outputs": [],
      "source": [
        "train_labels = np.array(train_labels)\n",
        "test_labels = np.array(test_labels)"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "01OAgkzX6_0s"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "到这里，数据预处理的工作基本完成，接下来该PyTorch登场了"
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "-SKNTwas6_0s"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 模型构建\n",
        "\n",
        "首先导出Pytorch需要用到的包"
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "bJWix0sq6_0s"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "outputs": [],
      "source": [
        "import torch\n",
        "from torch.utils.data import TensorDataset, DataLoader\n",
        "import torch.nn as nn"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "CSdCaxjs6_0s"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "构建训练数据集和测试数据集的DataLoader，同时**定义BatchSize为200**:"
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "ngzQUMlL6_0s"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "outputs": [],
      "source": [
        "batch_size = 200\n",
        "\n",
        "train_data = TensorDataset(torch.from_numpy(train_sentences), torch.from_numpy(train_labels))\n",
        "test_data = TensorDataset(torch.from_numpy(test_sentences), torch.from_numpy(test_labels))\n",
        "\n",
        "train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)\n",
        "test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "rB8exua36_0s"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "如果有条件，建议使用显卡来加速计算："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "vR-y2-by6_0s"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "outputs": [],
      "source": [
        "device = torch.device('cuda') if torch.cuda.is_available() else torch.device(\"cpu\")"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "-ltzuAyQ6_0s"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "outputs": [],
      "source": [
        "class SentimentNet(nn.Module):\n",
        "    def __init__(self, vocab_size):\n",
        "        super(SentimentNet, self).__init__()\n",
        "        self.n_layers = n_layers = 2 # LSTM的层数\n",
        "        self.hidden_dim = hidden_dim = 512 # 隐状态的维度，即LSTM输出的隐状态的维度为512\n",
        "        embedding_dim = 400 # 将单词编码成400维的向量\n",
        "        drop_prob=0.5 # dropout\n",
        "\n",
        "        # 定义embedding，负责将数字编码成向量，详情可参考：https://blog.csdn.net/zhaohongfei_358/article/details/122809709\n",
        "        self.embedding = nn.Embedding(vocab_size, embedding_dim)\n",
        "\n",
        "        self.lstm = nn.LSTM(embedding_dim, # 输入的维度\n",
        "                            hidden_dim, # LSTM输出的hidden_state的维度\n",
        "                            n_layers, # LSTM的层数\n",
        "                            dropout=drop_prob,\n",
        "                            batch_first=True # 第一个维度是否是batch_size\n",
        "                           )\n",
        "\n",
        "\n",
        "\n",
        "        # LSTM结束后的全连接线性层\n",
        "        self.fc = nn.Linear(in_features=hidden_dim, # 将LSTM的输出作为线性层的输入\n",
        "                            out_features=1 # 由于情感分析只需要输出0或1，所以输出的维度是1\n",
        "                            )\n",
        "        self.sigmoid = nn.Sigmoid() # 线性层输出后，还需要过一下sigmoid\n",
        "\n",
        "        # 给最后的全连接层加一个Dropout\n",
        "        self.dropout = nn.Dropout(drop_prob)\n",
        "\n",
        "    def forward(self, x, hidden):\n",
        "        \"\"\"\n",
        "        x: 本次的输入，其size为(batch_size, 200)，200为句子长度\n",
        "        hidden: 上一时刻的Hidden State和Cell State。类型为tuple: (h, c),\n",
        "        其中h和c的size都为(n_layers, batch_size, hidden_dim), 即(2, 200, 512)\n",
        "        \"\"\"\n",
        "        # 因为一次输入一组数据，所以第一个维度是batch的大小\n",
        "        batch_size = x.size(0)\n",
        "\n",
        "        # 由于embedding只接受LongTensor类型，所以将x转换为LongTensor类型\n",
        "        x = x.long()\n",
        "\n",
        "        # 对x进行编码，这里会将x的size由(batch_size, 200)转化为(batch_size, 200, embedding_dim)\n",
        "        embeds = self.embedding(x)\n",
        "\n",
        "        # 将编码后的向量和上一时刻的hidden_state传给LSTM，并获取本次的输出和隐状态（hidden_state, cell_state）\n",
        "        # lstm_out的size为 (batch_size, 200, 512)，200是单词的数量，由于是一个单词一个单词送给LSTM的，所以会产生与单词数量相同的输出\n",
        "        # hidden为tuple(hidden_state, cell_state)，它们俩的size都为(2, batch_size, 512), 2是由于lstm有两层。由于是所有单词都是共享隐状态的，所以并不会出现上面的那个200\n",
        "        lstm_out, hidden = self.lstm(embeds, hidden)\n",
        "\n",
        "        # 接下来要过全连接层，所以size变为(batch_size * 200, hidden_dim)，\n",
        "        # 之所以是batch_size * 200=40000，是因为每个单词的输出都要经过全连接层。\n",
        "        # 换句话说，全连接层的batch_size为40000\n",
        "        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)\n",
        "\n",
        "        # 给全连接层加个Dropout\n",
        "        out = self.dropout(lstm_out)\n",
        "\n",
        "        # 将dropout后的数据送给全连接层\n",
        "        # 全连接层输出的size为(40000, 1)\n",
        "        out = self.fc(out)\n",
        "\n",
        "        # 过一下sigmoid\n",
        "        out = self.sigmoid(out)\n",
        "\n",
        "        # 将最终的输出数据维度变为 (batch_size, 200)，即每个单词都对应一个输出\n",
        "        out = out.view(batch_size, -1)\n",
        "\n",
        "        # 只去最后一个单词的输出\n",
        "        # 所以out的size会变为(200, 1)\n",
        "        out = out[:,-1]\n",
        "\n",
        "        # 将输出和本次的(h, c)返回\n",
        "        return out, hidden\n",
        "\n",
        "    def init_hidden(self, batch_size):\n",
        "        \"\"\"\n",
        "        初始化隐状态：第一次送给LSTM时，没有隐状态，所以要初始化一个\n",
        "        这里的初始化策略是全部赋0。\n",
        "        这里之所以是tuple，是因为LSTM需要接受两个隐状态hidden state和cell state\n",
        "        \"\"\"\n",
        "        hidden = (torch.zeros(self.n_layers, batch_size, self.hidden_dim).to(device),\n",
        "                  torch.zeros(self.n_layers, batch_size, self.hidden_dim).to(device)\n",
        "                 )\n",
        "        return hidden"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "MvOKKm2a6_0t"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "模型定义完毕，构建模型对象："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "XMh1JBZL6_0t"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "SentimentNet(\n",
              "  (embedding): Embedding(221604, 400)\n",
              "  (lstm): LSTM(400, 512, num_layers=2, batch_first=True, dropout=0.5)\n",
              "  (fc): Linear(in_features=512, out_features=1, bias=True)\n",
              "  (sigmoid): Sigmoid()\n",
              "  (dropout): Dropout(p=0.5, inplace=False)\n",
              ")"
            ]
          },
          "metadata": {},
          "execution_count": 25
        }
      ],
      "source": [
        "model = SentimentNet(len(words))\n",
        "model.to(device)"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "pqZmoWar6_0t",
        "outputId": "12839ec8-c4b3-4d86-c4b7-aff71dc7c066",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "接下来定义损失函数，由于是二分类问题，所以使用**交叉熵（Binary Cross Entropy，BCE）**："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "sb3eb6bZ6_0t"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "outputs": [],
      "source": [
        "criterion = nn.BCELoss()"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "Wg2L1kOQ6_0t"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "优化器选用Adam优化器："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "bvvd1SYa6_0t"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "outputs": [],
      "source": [
        "lr = 0.005\n",
        "optimizer = torch.optim.Adam(model.parameters(), lr=lr)"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "_6P6YNYM6_0t"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "接下来定义训练代码："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "sY7N__vE6_0t"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Epoch: 1/2... Step: 1000... Loss: 0.268714...\n",
            "Epoch: 1/2... Step: 2000... Loss: 0.187919...\n",
            "Epoch: 1/2... Step: 3000... Loss: 0.215379...\n",
            "Epoch: 1/2... Step: 4000... Loss: 0.195820...\n",
            "Epoch: 2/2... Step: 5000... Loss: 0.130096...\n",
            "Epoch: 2/2... Step: 6000... Loss: 0.110538...\n",
            "Epoch: 2/2... Step: 7000... Loss: 0.198314...\n",
            "Epoch: 2/2... Step: 8000... Loss: 0.233867...\n"
          ]
        }
      ],
      "source": [
        "epochs = 2 # 一共训练两轮\n",
        "counter = 0 # 用于记录训练次数\n",
        "print_every = 1000 # 每1000次打印一下当前状态\n",
        "\n",
        "for i in range(epochs):\n",
        "    h = model.init_hidden(batch_size) # 初始化第一个Hidden_state\n",
        "\n",
        "    for inputs, labels in train_loader: # 从train_loader中获取一组inputs和labels\n",
        "        counter += 1 # 训练次数+1\n",
        "\n",
        "        # 将上次输出的hidden_state转为tuple格式\n",
        "        # 因为有两次，所以len(h)==2\n",
        "        h = tuple([e.data for e in h])\n",
        "\n",
        "        # 将数据迁移到GPU\n",
        "        inputs, labels = inputs.to(device), labels.to(device)\n",
        "\n",
        "        # 清空模型梯度\n",
        "        model.zero_grad()\n",
        "\n",
        "        # 将本轮的输入和hidden_state送给模型，进行前向传播，\n",
        "        # 然后获取本次的输出和新的hidden_state\n",
        "        output, h = model(inputs, h)\n",
        "\n",
        "        # 将预测值和真实值送给损失函数计算损失\n",
        "        loss = criterion(output, labels.float())\n",
        "\n",
        "        # 进行反向传播\n",
        "        loss.backward()\n",
        "\n",
        "        # 对模型进行裁剪，防止模型梯度爆炸\n",
        "        # 详情请参考：https://blog.csdn.net/zhaohongfei_358/article/details/122820992\n",
        "        nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)\n",
        "\n",
        "        # 更新权重\n",
        "        optimizer.step()\n",
        "\n",
        "        # 隔一定次数打印一下当前状态\n",
        "        if counter%print_every == 0:\n",
        "            print(\"Epoch: {}/{}...\".format(i+1, epochs),\n",
        "                  \"Step: {}...\".format(counter),\n",
        "                  \"Loss: {:.6f}...\".format(loss.item()))"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "Gzc82i5j6_0t",
        "outputId": "3c74bd52-f2f9-4e76-84bc-f667605b86dd",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "> 如果这里抛出了`RuntimeError: CUDA out of memory. Tried to allocate ...`异常，可以将batch_size调小，或者清空gpu中的缓存（`torch.cuda.empty_cache()`）"
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "bDVxnKpc6_0t"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "经过一段时间的训练，现在来评估一下模型的性能："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "AZSIMcDN6_0t"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Test loss: 0.202\n",
            "Test accuracy: 92.487%\n"
          ]
        }
      ],
      "source": [
        "test_losses = [] # 记录测试数据集的损失\n",
        "num_correct = 0 # 记录正确预测的数量\n",
        "h = model.init_hidden(batch_size) # 初始化hidden_state和cell_state\n",
        "model.eval() # 将模型调整为评估模式\n",
        "\n",
        "# 开始评估模型\n",
        "for inputs, labels in test_loader:\n",
        "    h = tuple([each.data for each in h])\n",
        "    inputs, labels = inputs.to(device), labels.to(device)\n",
        "    output, h = model(inputs, h)\n",
        "    test_loss = criterion(output.squeeze(), labels.float())\n",
        "    test_losses.append(test_loss.item())\n",
        "    pred = torch.round(output.squeeze()) # 将模型四舍五入为0和1\n",
        "    correct_tensor = pred.eq(labels.float().view_as(pred)) # 计算预测正确的数据\n",
        "    correct = np.squeeze(correct_tensor.cpu().numpy())\n",
        "    num_correct += np.sum(correct)\n",
        "\n",
        "print(\"Test loss: {:.3f}\".format(np.mean(test_losses)))\n",
        "test_acc = num_correct/len(test_loader.dataset)\n",
        "print(\"Test accuracy: {:.3f}%\".format(test_acc*100))"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "fa8y_0t86_0t",
        "outputId": "4d613e01-71d7-4347-fcfc-d926143678ae",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "最终，经过训练后，可以得到90%以上的准确率。\n",
        "\n",
        "我们来实际尝试一下，定义一个`predict(sentence)`函数，输入一个句子，输出其预测结果："
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "lT3opySB6_0u"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "outputs": [],
      "source": [
        "def predict(sentence):\n",
        "    # 将句子分词后，转换为数字\n",
        "    sentences = [[word2idx[word.lower()] if word.lower() in word2idx else 0 for word in nltk.word_tokenize(sentence)]]\n",
        "\n",
        "    # 将句子变为固定长度200\n",
        "    sentences = pad_input(sentences, 200)\n",
        "\n",
        "    # 将数据移到GPU中\n",
        "    sentences = torch.Tensor(sentences).long().to(device)\n",
        "\n",
        "    # 初始化隐状态\n",
        "    h = (torch.Tensor(2, 1, 512).zero_().to(device),\n",
        "         torch.Tensor(2, 1, 512).zero_().to(device))\n",
        "    h = tuple([each.data for each in h])\n",
        "\n",
        "    # 预测\n",
        "    if model(sentences, h)[0] >= 0.5:\n",
        "        print(\"positive\")\n",
        "    else:\n",
        "        print(\"negative\")"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "q56TaJZO6_0u"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 31,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "negative\n",
            "negative\n"
          ]
        }
      ],
      "source": [
        "predict(\"The film is so boring\")\n",
        "predict(\"The actor is too ugly.\")"
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "qil4MtPi6_0u",
        "outputId": "4df7aab6-0d64-493f-ee03-374f30856f44",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 参考资料\n",
        "\n",
        "[Long Short-Term Memory: From Zero to Hero with PyTorch](https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/): https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/"
      ],
      "metadata": {
        "collapsed": false,
        "pycharm": {
          "name": "#%% md\n"
        },
        "id": "1fQ0sKb26_0u"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "source": [
        ""
      ],
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "gL4V6ekz6_0u"
      }
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 2
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython2",
      "version": "2.7.6"
    },
    "colab": {
      "provenance": [],
      "machine_shape": "hm"
    },
    "accelerator": "GPU",
    "gpuClass": "standard"
  },
  "nbformat": 4,
  "nbformat_minor": 0
}